Official Statistics and R

CopenhagenR

Meetup at Statistics Denmark

February 28, 2017

DST_logo

R_logo

Introduction

DST

About Statistics Denmark

  • Statistics Denmark is the central authority on Danish statistics
  • Our mission is to collect, compile and publish statistics on the Danish society
  • Impartial statistics is a prerequisite for freedom of opinion, research and policy decisions taken on a reliable basis
  • Founded in 1850 and currently around 550 employees

Official statistics

  • National Statistical Institutes like Statistics Denmark produce what is often referred to as 'official statistics'
  • Official statistics has to fulfil certain demands, e.g. professional independence and sound methodology
  • These demands are described in the “European Statistics Code of Practice”

CoP

Outline of talk

  • In this talk we will take you through the production of a typical questionnaire based statistics
  • We will refer to the CRAN Task View on official statistics which contains a lot of very useful R packages
  • We will end the talk by briefly presenting our technical setup

CRAN Task View

CoP

Production of statistics: Headlines

  • Sampling
  • Data collection
  • Editing
  • Imputation
  • Estimation (weights)
  • Aggregate
  • Seasonal adjustment
  • Disclosure control
  • Dissimination products

Generic Statistical Business Process Model

  • GSBPM is a framework for the production of (official) statistics (not a cook book)

CoP

GSBPM porcesses heavy on methodology

CoP

GSBPM 4.1

Description of example

  • Business survey with the objective to estimate turnover last year (2016) by industry classes

  • Variables in data frame

    • id - a unique 8 digit identification number (no information)
    • employees - number of employees in 2016 according to register
    • turnover_2015 - the turnover (in 1000 dkk) in 2015 (t-1)
    • region - the five administrative regions of Denmark (NUTS2)
    • industry - standard 10-group industrial classicifation (NACE)
  • A nice data set in the sense that besides last years turnover, there are no missing values (a bit urealistic but no need to complicate things)

Read the data

load(file="data/pop.RData")
summary (pop)
       id             employees        turnover_2015      
 Min.   :10000001   Min.   :   10.00   Min.   :        0  
 1st Qu.:10004046   1st Qu.:   13.00   1st Qu.:    13570  
 Median :10008092   Median :   20.00   Median :    27127  
 Mean   :10008092   Mean   :   59.62   Mean   :   133638  
 3rd Qu.:10012137   3rd Qu.:   38.00   3rd Qu.:    68054  
 Max.   :10016182   Max.   :19594.00   Max.   :116177834  
                                       NA's   :8151       
         region                              industry   
 North DK   :1699   Manufacturing                :3517  
 Central DK :3827   Construction                 :2971  
 Southern DK:3561   Trade and transport          :6221  
 Capital    :5275   Information and communication: 940  
 Zealand    :1820   Financial and insurance      : 113  
                    Real estate                  : 222  
                    Other business services      :2198  
table (pop$industry, pop$region)

                                North DK Central DK Southern DK Capital
  Manufacturing                      502       1068         958     609
  Construction                       363        736         708     710
  Trade and transport                607       1353        1410    2144
  Information and communication       59        149          86     608
  Financial and insurance              7         13          14      71
  Real estate                         22         50          38      91
  Other business services            139        458         347    1042

                                Zealand
  Manufacturing                     380
  Construction                      454
  Trade and transport               707
  Information and communication      38
  Financial and insurance             8
  Real estate                        21
  Other business services           212

Goal of estimation

We want to estimate turnover by industry in 2016

Observation: Distribution of 2015 turnover is extremely skewed to the right…

plot(ecdf(pop$turnover_2015),
     main="Distribution of turnover in year t-1",
     xlab="Turnover [1000dkk]",
     col="blue")

plot of chunk unnamed-chunk-4

plot(ecdf(log(pop$turnover_2015+0.0001)),
     main="Distribution of log(turnover in year t-1)",
     xlab="log(Turnover)",
     col="blue")

plot of chunk unnamed-chunk-5

Correlations

Correlation bewteen the two numerical variables in the data

cor (pop[c("employees", "turnover_2015")], use="pairwise.complete.obs")
              employees turnover_2015
employees     1.0000000     0.4388757
turnover_2015 0.4388757     1.0000000

A bit diffucult to see due to skewed distribution

plot (pop$employees, pop$turnover_2015, col="blue")

plot of chunk unnamed-chunk-7

Positive correlation visible with log-log

plot (log(pop$employees), log(pop$turnover_2015), col="blue")

plot of chunk unnamed-chunk-8

Assumption: Turnover at time \( t \) is highly correlated with turnover at \( t-1 \), hence we use turnover in 2015 to determine needed sample size.

Sample size

We characterise the distribution of turnover_2015:

(m_x <- mean (pop$turnover_2015, na.rm=TRUE))
[1] 133638.3
(s_x <- sd (pop$turnover_2015, na.rm=TRUE))
[1] 1444230

A standard error (se) of 2% equal to 95% confidence interval of

m_x * 0.02 * 1.96
[1] 5238.622

So, say we want a 95% confidence interval for the mean to be at most 134,000 +/- 5,000 then we need a sample of size

library (samplingbook)
(ss <- sample.size.mean(e=5000, S=s_x, N=nrow(pop), level=0.95))

sample.size.mean object: Sample size for mean estimate
With finite population correction: N=16182, precision e=5000 and standard deviation S=1444230

Sample size needed: 15405

We would normally estimate the total turnover rather than the mean, but if the population size is known without error (a reasonable assumption) the is only a difference of factor \( N \), so calculations for mean suffice

We confirm the silly large sample size by calculating the expected standard error by hand

n <- ss$n
N <- nrow (pop)
(se_x <- sqrt (1-n/N) * s_x / sqrt (n))
[1] 2549.764

A 95% confidence interval indeed has (half) width 5,000

1.96 * se_x
[1] 4997.538

Stratification

With simple random sampling we need a very large sample because of the extremely skewed population. The solution is stratification by size.

We will normally stratify by industrial class also because of two reason: 1) effect of size is varies by industry class 2) it will be easier to estimate at the industry level too

Package options: SamplingStrata: Optimal Stratification of Sampling Frames for Multipurpose Sampling Surveys stratification: Univariate Stratification of Survey Populations

However, for now we choose a very simple solution

pop$size <- cut (pop$employees,
                 c(10,100,1000,1e7),
                 right=FALSE,
                 include.lowest = TRUE,
                 labels=c("10-99","100-999","1000+"))

The result of a stratification by two factors can be seen immediately. Note that we have two factor wtih 3 respectively 7 levels, but we have only 20 non-empty strata (no real estate companies with 1000+ employees)

table (pop$industry, pop$size)

                                10-99 100-999 1000+
  Manufacturing                  3006     483    28
  Construction                   2878      84     9
  Trade and transport            5778     417    26
  Information and communication   836      94    10
  Financial and insurance          60      44     9
  Real estate                     206      16     0
  Other business services        2033     153    12

Stratum variable

It is very often convinient to create a stratum variable, even if many function have a formula-type argument for stratum

We name each stratum in a straight forward manner

Strata <- unique (pop[,c("industry", "size")])
Strata <- Strata[order(Strata$industry, Strata$size),]
rownames(Strata) <- NULL
Strata$stratum <- 1:dim(Strata)[1]
Strata <- Strata[c(3,1,2)]

Stratum data frame

Strata
   stratum                      industry    size
1        1                 Manufacturing   10-99
2        2                 Manufacturing 100-999
3        3                 Manufacturing   1000+
4        4                  Construction   10-99
5        5                  Construction 100-999
6        6                  Construction   1000+
7        7           Trade and transport   10-99
8        8           Trade and transport 100-999
9        9           Trade and transport   1000+
10      10 Information and communication   10-99
11      11 Information and communication 100-999
12      12 Information and communication   1000+
13      13       Financial and insurance   10-99
14      14       Financial and insurance 100-999
15      15       Financial and insurance   1000+
16      16                   Real estate   10-99
17      17                   Real estate 100-999
18      18       Other business services   10-99
19      19       Other business services 100-999
20      20       Other business services   1000+

Merge stratum number onto pop

pop <- merge (pop, Strata, by=c("industry", "size"))
head (pop)
      industry  size       id employees turnover_2015     region stratum
1 Construction 10-99 10002103        12         33458    Zealand       4
2 Construction 10-99 10007933        75            NA Central DK       4
3 Construction 10-99 10000600        45            NA    Capital       4
4 Construction 10-99 10007813        10         17194    Zealand       4
5 Construction 10-99 10002056        15            NA   North DK       4
6 Construction 10-99 10004287        32         66467    Zealand       4

We now want two summary statistics on stratum level: Size of each stratum (Nh) and standard deviation of last years turnover (sh)

We calculate the size of each stratum and also the standard deviation of last years turnover

Nh <- aggregate (pop$stratum,
                 list(stratum=pop$stratum),
                 length)
names(Nh)[2] <- "Nh"

sh <- aggregate (pop$turnover_2015,
                 list(stratum=pop$stratum),
                 sd, na.rm=TRUE)
names(sh)[2] <- "sh"

Strata <- merge.data.frame(Strata, Nh, by="stratum")
Strata <- merge.data.frame(Strata, sh, by="stratum")

These steps can probably be accomplished by very few lines using clever programming…

Strata
   stratum                      industry    size   Nh          sh
1        1                 Manufacturing   10-99 3006   220529.02
2        2                 Manufacturing 100-999  483  1450695.67
3        3                 Manufacturing   1000+   28  5853371.52
4        4                  Construction   10-99 2878    44177.32
5        5                  Construction 100-999   84   327835.73
6        6                  Construction   1000+    9   885385.85
7        7           Trade and transport   10-99 5778   255558.52
8        8           Trade and transport 100-999  417  1858479.04
9        9           Trade and transport   1000+   26 31653522.40
10      10 Information and communication   10-99  836    89874.69
11      11 Information and communication 100-999   94   347359.55
12      12 Information and communication   1000+   10  2797767.20
13      13       Financial and insurance   10-99   60   221949.99
14      14       Financial and insurance 100-999   44   267252.77
15      15       Financial and insurance   1000+    9   867862.30
16      16                   Real estate   10-99  206    75445.11
17      17                   Real estate 100-999   16   102247.21
18      18       Other business services   10-99 2033    72893.30
19      19       Other business services 100-999  153   376541.99
20      20       Other business services   1000+   12  1621369.72

Allocation

We now need to find an allocation, ie. how many units should be selected in each stratum (nh)

\[ n_h = n \frac{N_h s_h}{\sum _{h=1}^H (N_h s_h)} \]

A seemingly relevant funtion

n <- 5000
(allo <- stratasamp (n, Nh=Strata$Nh, Sh=Strata$sh, type="opt"))

Stratum   1   2   3   4  5 6    7   8   9 10 11 12 13 14 15 16 17  18 19
Size    640 677 158 123 27 8 1426 749 795 73 32 27 13 11  8 15  2 143 56

Stratum 20
Size    19

But...

…in many cases the optimal number exceeds the available units

Strata$nh <- allo[2,]
rm (allo)
(TooMany <- (Strata$nh > Strata$Nh))
 [1] FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE
[12]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
sum (TooMany)
[1] 6

In these cases we select all (a take all stratum)

Strata$nh[TooMany] <- Strata$Nh[TooMany]
Strata$nh[!TooMany] <- NA

And the remaining sample is allocated using the same function

n_rem <- n - sum(Strata$nh, na.rm=TRUE)
s <- Strata [!TooMany, ]
(allo <- stratasamp (n_rem, Nh=s$Nh, Sh=s$sh))

Stratum   1   2  3 4    5   6  7  8  9 10 11 12  13 14
Size    795 762 22 2 1529 221 25 16 12  2 55  4 538 40

Now things are ok

s$nh <- allo[2,]
sum (s$nh > s$Nh)
[1] 0

And we have an allocation

Strata <- rbind (Strata[TooMany,], s)
sum (Strata$nh)
[1] 4999

Tidy up a bit a calculate sampling fraction

Strata <- Strata[order(Strata$stratum),]
Strata$f <- Strata$nh / Strata$Nh
   stratum                      industry    size   Nh   nh         f
1        1                 Manufacturing   10-99 3006  795 0.2644711
2        2                 Manufacturing 100-999  483  483 1.0000000
3        3                 Manufacturing   1000+   28   28 1.0000000
4        4                  Construction   10-99 2878  762 0.2647672
5        5                  Construction 100-999   84   22 0.2619048
6        6                  Construction   1000+    9    2 0.2222222
7        7           Trade and transport   10-99 5778 1529 0.2646244
8        8           Trade and transport 100-999  417  417 1.0000000
9        9           Trade and transport   1000+   26   26 1.0000000
10      10 Information and communication   10-99  836  221 0.2643541
11      11 Information and communication 100-999   94   25 0.2659574
12      12 Information and communication   1000+   10   10 1.0000000
13      13       Financial and insurance   10-99   60   16 0.2666667
14      14       Financial and insurance 100-999   44   12 0.2727273
15      15       Financial and insurance   1000+    9    2 0.2222222
16      16                   Real estate   10-99  206   55 0.2669903
17      17                   Real estate 100-999   16    4 0.2500000
18      18       Other business services   10-99 2033  538 0.2646335
19      19       Other business services 100-999  153   40 0.2614379
20      20       Other business services   1000+   12   12 1.0000000

Now it is easy to select the sample:

library (sampling)
set.seed (26743583)
pop <- pop[order(pop$stratum),]
s <- sampling::strata (pop,
                       stratanames="stratum",
                       size=Strata$nh,
                       method="srswor")

Finally extract the sample and save it (together with population and stratum level data)

MySample <- getdata (pop, s)
save (MySample, Strata, pop, file="data/41_sampling.Rdata")

GSBPM 5.3

Checking input validity with validate

Load the package and data

library(validate)
data(mtcars)

Introduce errors

mtcars$am <- sample(0:2, nrow(mtcars), replace=T)
mtcars$mpg <- with(mtcars, mpg * sample(c(-1,1)), nrow(mtcars), replace =T)

Confront the data with rules

cf <- check_that(mtcars, mpg > 0,
                 cyl %% 1 == 0, am %in% c(0,1))

Quickly get an overview of data state

Plot the results

barplot(cf, main="Errors in the mtcars dataset")

plot of chunk unnamed-chunk-37

Find offending records

Using aggregate

library(magrittr)
cf %>% aggregate(by = "record") %>% head
  npass nfail nNA  rel.pass  rel.fail rel.NA
1     2     1   0 0.6666667 0.3333333      0
2     3     0   0 1.0000000 0.0000000      0
3     1     2   0 0.3333333 0.6666667      0
4     3     0   0 1.0000000 0.0000000      0
5     2     1   0 0.6666667 0.3333333      0
6     3     0   0 1.0000000 0.0000000      0

Managing rules with validators

Predefine rules

v1 <- validator(am >= 0, cyl < 5)
confront(mtcars, v1)
Object of class 'validation'
Call:
    confront(x = mtcars, dat = v1)

Confrontations: 2
With fails    : 1
Warnings      : 0
Errors        : 0

Getting rules from files

Create rules with YAML

writeLines(readLines("data/rules.yaml"))
rules :
-
 expr : am %in% c(0,1)
 name : gearModes
 description : Either automatic or not
-
 expr : mpg > 10
 name : minimumMPG
 description : EU rules dictate minimum mpg
---

Rules from files cont.

validator(.file = "data/rules.yaml") %>% confront(mtcars, .)
Object of class 'validation'
Call:
    confront(x = mtcars, dat = .)

Confrontations: 2
With fails    : 2
Warnings      : 0
Errors        : 0

GSBPM 5.4

Systematic errors

Some errors are believed to be systematic (ie. we think we know the proces that caused the error). Once the mechanism is know the solution is obvious.

Some of these errors can be detected and corrected with R.

Frequently seen systematic errors

  • typos
  • sign errors
  • missing zeros
  • switched characters
  • switched fields

A sample data set

dat <- data.frame(
  Turnover = c(1000, 353),
  Cost = c(80, 283),
  Profit = c(200, 115))
dat
  Turnover Cost Profit
1     1000   80    200
2      353  283    115

The first observation is obviously erroneous, and there is something wrong with the second as well

Introducing edits

Edits can be hard coded

library (editrules)
E <- editmatrix("Turnover - Cost == Profit")
E
Edit matrix:
     Cost Profit Turnover Ops CONSTANT
num1   -1     -1        1  ==        0

Edit rules:
num1 : Turnover == Cost + Profit  

Correct data with deducorrect

Load package deducorrect and use function correctTypos()

library (deducorrect)
res <- correctTypos (E, dat)

Result can be viewed like this

res$corrected
  Turnover Cost Profit
1     1000  800    200
2      353  238    115
dat
  Turnover Cost Profit
1     1000   80    200
2      353  283    115
res$corrections
  row variable old new
1   1     Cost  80 800
2   2     Cost 283 238

The first correction was pretty forward - the second was harder to detect…

Another example

Here we have four different business areas with corresponding revenue and costs:

dat <- data.frame(
  x0 = c(195, 610),
  x1.r = c(2100, 5100),
  x1.c = c(1950, 4650),
  x1 = c(150, 450),
  x2.r = c(0, 0),
  x2.c = c(10, 130),
  x2 = c(10, 130),
  x3.r = c(20, 20),
  x3.c = c(5, 0),
  x3 = c(15, 20),
  x4.r = c(50, 15),
  x4.c = c(10, 25),
  x4 = c(40, 10))
dat
   x0 x1.r x1.c  x1 x2.r x2.c  x2 x3.r x3.c x3 x4.r x4.c x4
1 195 2100 1950 150    0   10  10   20    5 15   50   10 40
2 610 5100 4650 450    0  130 130   20    0 20   15   25 10

The rules are pretty straight forward

  • total profit is sum of four business areas
  • for each area, profit is revenue minus cost
E <- editmatrix(c("x0 == x1 + x2 + x3 + x4",
                  "x1 == x1.r - x1.c",
                  "x2 == x2.r - x2.c",
                  "x3 == x3.r - x3.c",
                  "x4 == x4.r - x4.c"))

Confronting the data reveals a number of errors:

E
Edit matrix:
     x0 x1 x2 x3 x4 x1.c x1.r x2.c x2.r x3.c x3.r x4.c x4.r Ops CONSTANT
num1  1 -1 -1 -1 -1    0    0    0    0    0    0    0    0  ==        0
num2  0  1  0  0  0    1   -1    0    0    0    0    0    0  ==        0
num3  0  0  1  0  0    0    0    1   -1    0    0    0    0  ==        0
num4  0  0  0  1  0    0    0    0    0    1   -1    0    0  ==        0
num5  0  0  0  0  1    0    0    0    0    0    0    1   -1  ==        0

Edit rules:
num1 : x0 == x1 + x2 + x3 + x4 
num2 : x1 + x1.c == x1.r 
num3 : x2 + x2.c == x2.r 
num4 : x3 + x3.c == x3.r 
num5 : x4 + x4.c == x4.r  
(violatedEdits(E, dat))
      edit
record  num1  num2 num3  num4  num5
     1  TRUE FALSE TRUE FALSE FALSE
     2 FALSE FALSE TRUE FALSE  TRUE

There is even a plot method for violated edits

plot (violatedEdits(E, dat)) 

plot of chunk unnamed-chunk-51

Corrections of sign errors by “flipping” sign

dat
   x0 x1.r x1.c  x1 x2.r x2.c  x2 x3.r x3.c x3 x4.r x4.c x4
1 195 2100 1950 150    0   10  10   20    5 15   50   10 40
2 610 5100 4650 450    0  130 130   20    0 20   15   25 10
150+10+15+40
[1] 215
res1 <- correctSigns (E, dat) 
res1$corrected 
   x0 x1.r x1.c  x1 x2.r x2.c  x2 x3.r x3.c x3 x4.r x4.c x4
1 195 2100 1950 150    0   10 -10   20    5 15   50   10 40
2 610 5100 4650 450    0 -130 130   20    0 20  -15  -25 10
res1$corrections
  row variable old  new
1   1       x2  10  -10
2   2     x2.c 130 -130
3   2     x4.c  25  -25
4   2     x4.r  15  -15

Were the correctiosn correct?

dat
   x0 x1.r x1.c  x1 x2.r x2.c  x2 x3.r x3.c x3 x4.r x4.c x4
1 195 2100 1950 150    0   10  10   20    5 15   50   10 40
2 610 5100 4650 450    0  130 130   20    0 20   15   25 10
res1$corrected 
   x0 x1.r x1.c  x1 x2.r x2.c  x2 x3.r x3.c x3 x4.r x4.c x4
1 195 2100 1950 150    0   10 -10   20    5 15   50   10 40
2 610 5100 4650 450    0 -130 130   20    0 20  -15  -25 10

Looking at the numbers indicates, that in row 2 x4.c and x4.r probably were not two sign errors but one error with swapped fields.

We create a list with varibles that are allowed to be swapped

dat
   x0 x1.r x1.c  x1 x2.r x2.c  x2 x3.r x3.c x3 x4.r x4.c x4
1 195 2100 1950 150    0   10  10   20    5 15   50   10 40
2 610 5100 4650 450    0  130 130   20    0 20   15   25 10
PairList = list (c("x1.r","x1.c"), c("x2.r","x2.c"), c("x3.r","x3.c"), c("x4.r","x4.c"))

We now allow this type of error to be detected

res2 <- correctSigns (E, dat, swap = PairList) 
res2$corrected 
   x0 x1.r x1.c  x1 x2.r x2.c  x2 x3.r x3.c x3 x4.r x4.c x4
1 195 2100 1950 150    0   10 -10   20    5 15   50   10 40
2 610 5100 4650 450    0 -130 130   20    0 20   25   15 10
res2$corrections 
  row variable old  new
1   1       x2  10  -10
2   2     x2.c 130 -130
3   2     x4.c  25   15
4   2     x4.r  15   25
res1$status 
     status weight degeneracy nflip nswap
1 corrected      1          1     1     0
2 corrected      3          1     3     0
res2$status 
     status weight degeneracy nflip nswap
1 corrected      1          1     1     0
2 corrected      2          2     1     1

GSBPM 5.6

Estimation from finite populations

Read the data again

load(file="data/41_sampling.RData")

By magic attach survey response (with 100% response rate!)

load(file="data/resp.RData")
sampresp <- merge (MySample, resp, by="id", all.x = TRUE, all.y = FALSE)
sampresp [,c("ID_unit", "Stratum", "turnover_2015")] <- NULL
sampresp <- merge.data.frame(sampresp, Strata[,c("stratum","Nh")], by="stratum")
head (sampresp)
  stratum       id      industry  size employees      region      Prob
1       1 10000004 Manufacturing 10-99        15  Central DK 0.2644711
2       1 10000006 Manufacturing 10-99        27    North DK 0.2644711
3       1 10006984 Manufacturing 10-99        13 Southern DK 0.2644711
4       1 10015258 Manufacturing 10-99        48     Zealand 0.2644711
5       1 10000015 Manufacturing 10-99        38  Central DK 0.2644711
6       1 10000026 Manufacturing 10-99        26  Central DK 0.2644711
  turnover_2016   Nh
1         14223 3006
2         35050 3006
3          6729 3006
4         73836 3006
5         46489 3006
6        272710 3006

Enter survey package

Load package and create a survey object

library (survey)
stsi.design <- svydesign(id=~1,
                         strata=~stratum,
                         fpc=~Nh,
                         data=sampresp)

Estimation of survey total

(ty_hat <- svytotal(~turnover_2016, stsi.design))
                   total       SE
turnover_2016 1849924516 21804593

Compare with true sum in population

(ty <- sum(resp$turnover_2016)/1e6)
[1] 1883.121

Coefficient of variation (in percent)

cv(ty_hat)*100
              turnover_2016
turnover_2016      1.178675

… or 95% pct confidence interval in billion dkk (equals +/- 2 times standard errors)

confint(ty_hat)/1e6
                 2.5 %   97.5 %
turnover_2016 1807.188 1892.661

… or +/- 1 times estimated standard error (68% confidence interval)

confint(ty_hat, level=0.68)/1e6
                  16 %     84 %
turnover_2016 1828.241 1871.608

Domain estimate by region

ty_region_hat <- svyby(~turnover_2016, ~region, stsi.design, svytotal)
ty_region_hat$cv <- ty_region_hat$se / ty_region_hat$turnover_2016 * 100
ty_region_hat
                 region turnover_2016       se       cv
North DK       North DK     110341169  5288767 4.793104
Central DK   Central DK     407435506  7796006 1.913433
Southern DK Southern DK     347020018 11325539 3.263656
Capital         Capital     884623135 20275801 2.292027
Zealand         Zealand     100504688  4845114 4.820784

Sum of domain estimates consistent with total

sum(ty_region_hat$turnover_2016)
[1] 1849924516
ty_hat
                   total       SE
turnover_2016 1849924516 21804593

GSBPM 6.1

Purpose of seasonal adjustment

Purpose of seasonal adjustment is to facilitate month-to-month comparison (all else being equal).

We do that by decomposing an observed time series into unobserved components using either an additive model \[ X_t = T_t + S_t + I_t \] or a multiplicative model \[ X_t = T_t \cdot S_t \cdot I_t \] where

  • \( T_t \) is the trend
  • \( S_t \) is the seasonality
  • \( I_t \) is the remainder

Read som sample data

Read the data

load(file="data/unemp.RData")
str (unemp)
 Time-Series [1:120] from 2007 to 2017: 130100 126542 121465 110681 102920 ...
frequency(unemp)
[1] 12
plot(unemp, col="blue", ylab="Gross unemployment [no. persons]", main="Monthly gross unemployment")

plot of chunk unnamed-chunk-71

Illustrate seasonal pattern

Reshape ts object as data frame and add columns for year and month

library(lubridate)
library (zoo)

X <- data.frame(unemp=as.matrix(unemp), date=as.Date(as.yearmon(time(unemp))))
X$year <- as.ordered(year(X$date))
X$month <- as.ordered(month(X$date))
head (X)
   unemp       date year month
1 130100 2007-01-01 2007     1
2 126542 2007-02-01 2007     2
3 121465 2007-03-01 2007     3
4 110681 2007-04-01 2007     4
5 102920 2007-05-01 2007     5
6  98808 2007-06-01 2007     6
library (ggplot2)
ggplot(data=X, aes(x=month, y=unemp, group=year, colour=year)) +
 geom_line() +
 geom_point()

plot of chunk unnamed-chunk-73

Simple adjustments

There are many possibilities in base R, like decompose() and stl() which are both from the stats package

plot (stats::decompose(unemp,"multiplicative"))

plot of chunk unnamed-chunk-74

plot (stl (unemp, "periodic"), main="Decomposition using function 'stl'")

plot of chunk unnamed-chunk-75

More complicated

tbats from the forecast package is somewhat more complicated (exponential smoothing state space model with Box-Cox transformation, ARMA errors, Trend and Seasonal components…)

library (forecast)
plot (tbats (unemp))

plot of chunk unnamed-chunk-76

All three give similar results

Seasonal adjust using the forecast::seasadj function:

plot of chunk unnamed-chunk-77

X-13ARIMA-SEATS

If more control or options are needed, we turn to X-13ARIMA-SEATS (or its predecessor X-12-ARIMA).

We need the ablity to add regressors to account for

  • calendar effect (length of month, difference in week days, holidays including Easter)
  • outliers (additive outliers, level shift, temporary change) These are permanent and temporary adjustments, respectively

The program X-13ARIMA-SEATS is from the Census Bureau (“Statistics US”). It's a single executable that works through input and output of text files.

The package x12 from Statistics Austria has many functionalities for using X-13ARIMA-SEATS from within R

library (x12)

First create an x12 object

unemp_x12s <- new ("x12Single", ts=unemp, tsName="Gross_unemployment")

The binaries must be installed on machine

x12path(file.path("c:/data/dst/x13as","x13as.exe"))

Then it is simple to run with default settings

s <- x12 (unemp_x12s)

We can get a lot of information about the seasonal adjustment model using the built-in summary function:

summary(s)
--------------------------   Gross_unemployment   ------------------------------------
-----------------------------------------------------------------------------------

    Time Series

Frequency: 12 
Span: 1st month,2007 to 12th month,2016 

    Model Definition

ARIMA Model: (1 1 0)(0 1 1) (Automatic Model Choice)
Model Span:  
Transformation: Automatic selection : No transformation 
Regression Model: none 

    Outlier Detection

No outlier detection performed

    Seasonal Adjustment

Identifiable Seasonality: yes 
Seasonal Peaks: none 
Trading Day Peaks: none 
Overall Index of Quality of SA
(Acceptance Region from 0 to 1)
Q: 0.27 
Number of M statistics outside the limits: 0 

SA decomposition: additive 
Moving average used to estimate the seasonal factors: 3x3 
Moving average used to estimate the final trend-cycle: 9-term Henderson filter

The various components of the decomposed series are contained in the x12-object and are easily retrieved:

plot(s@x12Output@a1, col="blue", ylab="Gross unemployment [no. persons]")
lines(s@x12Output@d11, col="red")
legend ("bottomright", c("Original", "Seasonally adjusted"),
        col=c("blue", "red"), lty=c(1,1))

plot of chunk unnamed-chunk-83

A spectral plot can be obtained using the built-in plotSpec function:

plotSpec(s)

plot of chunk unnamed-chunk-84

Similarly, the seasonal factors can be plotted using the built-in plotSeasFac function:

plotSeasFac(s)

plot of chunk unnamed-chunk-85

And finally, an ACF-plot is obtained using the plotRsdAcf function:

plotRsdAcf(s)

plot of chunk unnamed-chunk-86

GSBPM 6.4

GSBPM 6.4 Apply disclosure control

  • Packages sdcTable and sdcMicro
  • We will look at this some other time :-)

GSBPM 7.2

Example 1

Example 2

And now for something slightly different

Link

R environment in Statistics Denmark

  • Local R restricted to Methodology and Analysis and only for testing packages
  • Centralized, server-based
    • High capacity, CPU and memory
    • Better managability
    • Secure, audited access to confidential data
  • Server specifications
    • Windows Server 2012 R2
    • 20 Intel Xeon E5-2660 v3 cores @ 2.6 GHz
    • 384 GB main memory

R environment in Statistics Denmark

  • 40 users has access to R
  • Normally less than 10 connected simultaneously
  • Windows terminal server RemoteApp to RStudio
    • Seamless integration into client desktop
    • No cut/paste between client and remote R for security reasons

Link

Access to confidential data

  • R server has only a few paths to data
    • Subversion access for exchange of program source code
    • Access to data in Oracle Database
    • User access to data in Oracle granted/revoked through existing timeproven processes
    • All access to data in Oracle audited as required in data protection regulation
    • Output (eg. pdf) transfer through homemade (audited) pigeonhole App.

Access paths to/from R environment

Link

Management of R environment 1/3

  • Currently
    • R 3.3.2 x64
    • RStudio 1.0.44
    • ROracle 1.3.1 Oracle Interface for R (based on DBI)
  • R upgraded during workhours without disrupting operations
    • Older releases of R left untouched
    • Install new R, quickly change default R to new - 1
    • Install packages on new R
    • Change default R to new R
    • Done

Management of R environment 2/3

  • R Packages strictly managed
  • Whitelist of appoved packages maintained by Methodology and Analysis
  • Installed by script using miniCran
  • Pseudocode for building of miniCran repository
    • Install miniCran on local R
    • Build list of packages from whitelist using pkgDep including dependencies and suggestions
    • Build repository using makeRepo (source and binary)

Management of R environment 3/3

  • Copy newly built miniCran-repository to R server
  • Pseudocode for installing of miniCran repository
    • install.packages from miniCran
    • lapply (pkgs, require, character.only = TRUE) to verify loadability of packages.
  • miniCran scripts available to anyone interested

Thanks for the attention

##Questions are welcome!